Ensure proper shutdown of libbeat #1075

McStork · 2016-03-01T09:30:41Z

Libbeat publisher now exits once all async events are published, all
events outputed and ConnectionModes closed.

Move WorkerSignal to libbeat/common: WorkerSignals are used in packages publisher and outputs
Add a second WaitGroup to WorkerSignal. WaitGroups are used to track events and shutdown of goroutines
Remove TestMessageWorkerStopQueue and stopQueue function in WorkerSignal: events are not considered
failed at shutdown anymore
Adapt existing Libbeat tests
Add a few tests for published events after shutdown
Add method Close to outputs/Outputer. The method is called by publisher/outputWorker onStop()
Edit CHANGELOG

McStork · 2016-03-01T09:31:57Z

Could not reopen previous PR due to forced push

elasticsearch-release · 2016-03-01T09:32:01Z

Jenkins standing by to test this. If you aren't a maintainer, you can ignore this comment. Someone with commit access, please review this and clear it for Jenkins to run; then say 'jenkins, test it'.

ruflin · 2016-03-01T10:31:02Z

@McStork Could this also resolve #1074 ?

McStork · 2016-03-01T12:36:17Z

@ruflin No, I doubt it will help fix this race issue :(
Seems like the race is deep in ProtocolClient.

urso · 2016-03-02T11:19:41Z

libbeat/outputs/mode/balance.go

-			return false
-		}
+		ch <- msg
+		return true


this channel might block. Why remove the Done channel here?

Actually it should have the same behavior with or without this case <-done because here we make sure to output all events received.
The publisher being closed, balance forwardEvent() will not be called once we reach WGevent.Wait().

problem is, if output is unresponsive (e.g. infinite retry + network issues), the publisher will basically hang on shutdown doing no progress.

…down) Libbeat publisher now exits once all async events are published, all events outputed and ConnectionModes closed. * Move WorkerSignal to libbeat/common: WorkerSignals are used in packages publisher and outputs * Add a second WaitGroup to WorkerSignal. WaitGroups are used to track events and shutdown of Go routines * Add methods in WorkerSignal that wrap WaitGroups methods call * Remove TestMessageWorkerStopQueue and stopQueue function in WorkerSignal: events are not considered failed at shutdown anymore * Add tests for published events after shutdown * Add method Close to outputs/Outputer. The method is called by publisher/outputWorker onStop() * Edit CHANGELOG

McStork · 2016-03-04T09:29:36Z

Squashed and rebased after review changes

urso · 2016-03-04T14:57:34Z

LGTM.

Problem with this approach is, beats won't shutdown if output is unresponsive. But this is already the case, no regression here.

In future I'd like to have publisher.Client return a new client instance connecting to the publisher pipeline, which is closable.

So beats startup goes like:

start publisher pipeline
initialize beat itself
create (connect) client to pipeline

On shutdown we can disconnect from pipeline, so publisher pipeline and beats can shutdown in parallel. For filebeat the outputs.Signaler shall get a new event 'Stopped', so filebeat signal handler can differentiate between Fail and Shutdown handling for proper registry update on shutdown. With this change beats don't have to wait for pipeline being empty on shutdown, as pipeline is just dropped as is (events will be lost, yes). In addition libbeat might get support for internal persistent queue, to not drop events on shutdown (or in general).

Ensure proper shutdown of libbeat

McStork · 2016-03-04T16:04:50Z

@urso Thanks for giving your insight.

If I understand correctly, with a feature such as a persistent queue, a good solution would be to save unpublished events to disk at shutdown. A persistent queue would certainly help when getting stuck because of unresponsive output.

But is it be necessary do like this 100% of the time? What if:

If at shutdown the output gets stuck, events are on the disk anyway so none are lost and the program closes.
If at shutdown the output still forward events, wait for it to empty all events (or forward them until we reach a situation such as 1.)

The thing would be to measure the responsiveness of the output to know if we are in situation 1 or 2.

McStork · 2016-03-06T15:07:41Z

I think WorkerSignal can help solve the problem of output blocking the shutdown of a beat.
WorkerSignal could have a StopState channel that notify the connection modes that the beat is in shutdown mode.
StopState is closed when calling WorkerSignal.Stop(). The connection mode knowing it is in shutdown state, will drop events when he meets a failed connection/publishing attempts. WorkerSignal then ends without blocking.

I tried this today and it seems like it could work well :).

McStork · 2016-03-06T15:18:44Z

The StopState channel doesn't have to be in WorkerSignal. It can just be in PublisherType and closed before calling pub.wsOutput.Stop().

urso · 2016-03-06T20:44:27Z

If I understand correctly, with a feature such as a persistent queue, a good solution would be to save unpublished events to disk at shutdown. A persistent queue would certainly help when getting stuck because of unresponsive output.

Persistent queue is not about shutdown only, but more about resiliency. Having configured a persistent queue all events must pass the queue and only once ACKed by the queue, events will be published (with send-at-leat-once semantics).

Thing is, there is no paticular need to add too much logic for handling shutdown. For most beats it's fine to drop data. Beats not being fine with event loss (e.g. filebeat) remember where they left off and can restart from last location. No problem if queue is dropped. Still it's of interest to signal 'shutting' down events to filebeat to do one last registry update.

The thing would be to measure the responsiveness of the output to know if we are in situation 1 or 2.

I won't even try. Just drop events in pipeline and continue where left off. Filebeat/Winlogbeat or persistent queue remember where they left off and resend events not ACKed by output plugin.

The StopState channel doesn't have to be in WorkerSignal ...

Beats can already provide callback to PublishEvent (type output.Signaler) for getting Success/Fail events. This is used by filebeat async publisher mode:

Current problem is I can not distinguish between failure (which should only be possible during shutdown) or shutdown. Adding a shutdown callback we could do a 'panic' in 'Failed' handler.

Will be in vacation for next 2 weeks. Let's continue discussion thereafter.

ruflin added review libbeat labels Mar 1, 2016

urso reviewed Mar 2, 2016
View reviewed changes

McStork force-pushed the libbeat-proper-shut branch from 485186e to 9cc4581 Compare March 4, 2016 08:54

McStork force-pushed the libbeat-proper-shut branch from 9cc4581 to 7f16e11 Compare March 4, 2016 09:23

urso pushed a commit that referenced this pull request Mar 4, 2016

Merge pull request #1075 from McStork/libbeat-proper-shut

6d6cd52

Ensure proper shutdown of libbeat

urso merged commit 6d6cd52 into elastic:master Mar 4, 2016

McStork deleted the libbeat-proper-shut branch March 6, 2016 10:40

urso mentioned this pull request Apr 19, 2016

Add support for closing publisher.Client #1402

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure proper shutdown of libbeat #1075

Ensure proper shutdown of libbeat #1075

McStork commented Mar 1, 2016

McStork commented Mar 1, 2016

elasticsearch-release commented Mar 1, 2016

ruflin commented Mar 1, 2016

McStork commented Mar 1, 2016

urso Mar 2, 2016

McStork Mar 2, 2016

McStork Mar 2, 2016

urso Mar 2, 2016

McStork commented Mar 4, 2016

urso commented Mar 4, 2016

McStork commented Mar 4, 2016

McStork commented Mar 6, 2016

McStork commented Mar 6, 2016

urso commented Mar 6, 2016

Ensure proper shutdown of libbeat #1075

Ensure proper shutdown of libbeat #1075

Conversation

McStork commented Mar 1, 2016

McStork commented Mar 1, 2016

elasticsearch-release commented Mar 1, 2016

ruflin commented Mar 1, 2016

McStork commented Mar 1, 2016

urso Mar 2, 2016

Choose a reason for hiding this comment

McStork Mar 2, 2016

Choose a reason for hiding this comment

McStork Mar 2, 2016

Choose a reason for hiding this comment

urso Mar 2, 2016

Choose a reason for hiding this comment

McStork commented Mar 4, 2016

urso commented Mar 4, 2016

McStork commented Mar 4, 2016

McStork commented Mar 6, 2016

McStork commented Mar 6, 2016

urso commented Mar 6, 2016